Sophisticated Text Mining System for Extracting and Visualizing Numerical and Named Entity Information from a Large Number of Documents

نویسندگان

  • Masaki Murata
  • Tamotsu Shirado
  • Kentaro Torisawa
  • Masakazu Iwatate
  • Koji Ichii
  • Qing Ma
  • Toshiyuki Kanamaru
چکیده

We have developed a system that can semiautomatically extract numerical and named entity sets from a large number of Japanese documents and can create various kinds of tables and graphs. In our experiments, our system semiautomatically created approximately 300 kinds of graphs and tables at precisions of 0.2–0.8 with only 2 h of manual preparation from a 2-year stack of newspapers articles. Note that these newspaper articles contained a large quantity of data, and all of them could not be read or checked manually in such a short amount of time. From this perspective, we concluded that our system is useful and convenient for extracting information from a large number of documents. We have constructed a demonstration system. In this paper, we briefly describe the demonstration system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ارائه مدلی برای استخراج اطلاعات از مستندات متنی، مبتنی بر متن‌کاوی در حوزه یادگیری الکترونیکی

As computer networks become the backbones of science and economy, enormous quantities documents become available. So, for extracting useful information from textual data, text mining techniques have been used. Text Mining has become an important research area that discoveries unknown information, facts or new hypotheses by automatically extracting information from different written documents. T...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

Efficient Algorithm for Mining on Bio Medical Data for Ranking the Web Pages

Information in the internet is evolving in terms of high volume through different sources. Extracting tuples from HTML pages has been an important issue in various web applications such as web data integration, e-commerce market monitoring, and mash ups that repurpose and selectively combine existing web data services. Data Mining is the process of analyzing data from different perspectives and...

متن کامل

Efficient Algorithm for Mining on Bio Medical Data for Ranking the Web Pages

Information in the internet is evolving in terms of high volume through different sources. Extracting tuples from HTML pages has been an important issue in various web applications such as web data integration, e-commerce market monitoring, and mash ups that repurpose and selectively combine existing web data services. Data Mining is the process of analyzing data from different perspectives and...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008